GALLEY.TEX[TEX,DEK]2 - www.SailDart.org

perm filename GALLEY.TEX[TEX,DEK]2 blob sn#372432 filedate 1978-08-08 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00006 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\input acphdr
C00003 00003	%folio 385 galley 14 (C) Addison-Wesley 1978	*
C00013 00004	%folio 388 galley 15 (C) Addison-Wesley 1978	*
C00025 00005	\vfill\eject
C00033 00006
C00052 ENDMK
C⊗;
\input acphdr
\runninglefthead{ARITHMETIC---FIRST PROOFS $\copyright$ 1978}
\runningrighthead{ARITHMETIC---FIRST PROOFS $\copyright$ 1978}
\section{4.x}
\setcount0 701
\tenpoint
%folio 385 galley 14 (C) Addison-Wesley 1978	*
Sch\"onhage's paper [{\sl Computing \bf 1} (1966), 182--196]
shows that these ideas can be extended to the multiplication
of $n$-bit numbers using $r \approx 2↑{\sqrt{2\lg n}}$ moduli, obtaining
a method analogous to Algorithm C. We shall not dwell on the
details here, since Algorithm C is always superior; in fact,
an even better method is next on our agenda.

\subsectionbegin{C. Use of Fourier transforms} The
critical problem in high-precision multiplication is the determination
of ``convolution products'' such as
$$u↓rv↓0 + u↓{r-1}v↓1 + \cdots + u↓0v↓r,$$
and there is an intimate relation between
convolutions and finite Fourier transforms. If $\omega =\exp(2πi/K)$
is a $K$th root of unity, the one-dimensional Fourier transform
of $(u↓0, u↓1, \ldotss , u↓{K-1})$ may be defined to be $(
{\A u}↓0,{\A u}↓1, \ldotss , {\A u}↓{\!K-1})$, where
$${\A u}↓s = \sum ↓{0≤t<K}\omega ↑{st}u↓t,\qquad 0 ≤ s <
K.\eqno (32)$$
Letting $({\A v}↓0, {\A v}↓1, \ldotss , 
{\A v}↓{K-1})$ be defined in the same way, as the Fourier transform of $(v↓0,
v↓1, \ldotss , v↓{K-1})$, it is not difficult to see that $(
{\A u}↓0{\A v}↓0, {\A u}↓1{\A v}↓1, \ldotss , {\A u}↓{\!K-1}
{\A v}↓{\!K-1})$ is the transform of $(w↓0, w↓1, \ldotss , w↓{K-1})$,
where
$$\eqalign{w↓r⊗ = u↓rv↓0 + u↓{r-1}v↓1 + \cdots + u↓0v↓r + u↓{K-1}v↓{r+1}
+ \cdots + u↓{r+1}v↓{K-1}\cr
\noalign{\vskip 7pt}
⊗ = \sum ↓{i+j≡r\modulo K}u↓iv↓j.\cr}$$
When $K ≥ 2n-1$ and $u↓n = u↓{n+1} = \cdots = u↓{K-1}
= v↓n = v↓{n+1} = \cdots = v↓{K-1} = 0$, the $w$'s are just what we
need for multiplication; {\sl the transform of a convolution
product is the ordinary product of the transforms.} This idea
is actually a special case of Toom's use of polynomials $\biglp$cf.\ (10)$\bigrp
$, with $x$ replaced by roots of unity.

The above property of Fourier transforms was exploited by V.
Strassen in 1968, using a sufficiently precise binary representation
of the complex number $\omega $, to multiply large numbers faster
than was possible under all previously known schemes. In 1970,
he and A. Sch\"onhage found an elegant way to modify the
method, avoiding all the complications of complex numbers and
obtaining a very pretty algorithm capable of multiplying two
$n$-bit numbers in $O(n\log n\log\log n)$ steps. We shall now
study their remarkable approach [cf.\ {\sl Computing \bf 7} (1971),
281--292], in a simplified form suggested by V. R. Pratt.

It is convenient in the first place to replace $n$ by $2↑n$,
and to seek a procedure that multiplies $2↑n$-bit numbers in
$O(2↑n\,n\log n)$ steps. Roughly speaking, we shall reduce
the problem of multiplying $2↑n$-bit numbers to the problem
of doing about $2↑{n/2}$ multiplications of $2↑{n/2}$-bit numbers,
with $O(2↑n n)$ auxiliary steps required to piece these products
together properly; then there will be $\lg n$ levels of recursion
with $O(2↑n n)$ steps per level, making a total of $O(2↑n\,n
\log n)$ steps as desired.

Let $N = 2↑n$ and suppose we wish to compute the product of
$u$ and $v$, where $0 ≤ u, v < 2↑N$. As in Algorithm C we shall
break these $N$-bit numbers into groups; let
$$\baselineskip16pt\vjust{\halign{$\hfill\dispstyle{#}\hfill$\cr
k + l = n + 1,\cr
K = 2↑k,\qquad L = 2↑l,\cr}}$$
and write
$$u = (U↓{K/2-1} \ldotsm U↓1U↓0)↓{2↑L},\qquad v = (V↓{K/2-1}
\ldotsm V↓1V↓0)↓{2↑L},$$
regarding $u$ and $v$ as $2↑{k-1}$ groups of $2↑l$-bit
numbers. We will select appropriate values for $k$ and $l$ later;
it turns out (see exercise 10) that we will need to have
$$1 ≤ k ≤ l + 3≤n,\eqno (33)$$
but no other conditions. The above representation
of $u$ and $v$ implies as before that
$$u \cdot v = W↓{K-2}2↑{(K-2)L} + \cdots + W↓12↑L + W↓0,\eqno
(34)$$
where
$$W↓r = \sum ↓{i+j=r}U↓iV↓j = \sum ↓{i+j≡r\modulo K}U↓iV↓j,\eqno
(35)$$
if we define $U↓i = V↓j = 0$ for $i, j ≥ K/2$.
Clearly $0 ≤ W↓r ≤ (K/2)(2↑L - 1)↑2 < 2↑{2L+k-1}$; therefore
if we knew the $W$'s, we could compute $u \cdot v$ by adding
up the terms in (34), in $O\biglp K(2L + k)\bigrp = O(N)$ further
steps.

Our goal is to compute the $W↓r$ exactly; and we can do this
by computing their value mod $M$, where $M$ is any number larger
than $(K/2)(2↑L - 1)↑2$. The key idea is that we can choose
$M = 2↑{4L} + 1$, and compute the $W↓r$ by doing a ``fast Fourier
transform'' modulo $M$, where the $K$th root of unity $\omega$
we use is a power of 2 (so that multiplication by powers of
$\omega$ is very simple).

Before discussing this idea in detail, a numerical example of
what we've said so far may help to fix the ideas. Suppose that
we want to multiply two 4096-bit numbers, obtaining an 8192-bit
product; thus $n = 12$ in the above discussion, and we may choose
$k = 8$, $l = 5$. The bits of $u$ and $v$ are partitioned into
128 groups of 32 bits each, and the basic idea is to find the
256 convolution products (35) and to add them together (after
appropriate shifting). The convolution products have at most
$64 + 7 = 71$ bits each, so it surely suffices to determine them
modulo $M = 2↑{128} + 1$. We will see that it is possible to
find the convolution products rapidly by first computing their
finite Fourier transform mod $M$, using the integer $\omega
= 2$ as a 256th ``root of unity.'' These integer calculations
mod $M$ turn out to have all the necessary properties of complex
roots of unity in the ordinary Fourier transform (32).

Arithmetic mod $(2↑m + 1)$ is somewhat similar to ones' complement
arithmetic, mod $(2↑m - 1)$, although it is slightly more complicated;
we have already investigated the idea briefly in Section 3.2.1.1.
Numbers can be represented as $m$-bit quantities in binary notation,
except for the special value $-1 ≡ 2↑m$, which may be represented
in some special way. Addition mod $(2↑m + 1)$
is easily done in $O(m)$ cycles, since
a carry off the left end merely means that we must subtract 1 at
the right; similarly, subtraction mod $(2↑m + 1)$ is quite simple.
Furthermore, we can multiply by $2↑r$ in $O(m)$ cycles, when
$0 ≤ r < m$, since
$$2↑r \cdot (u↓{m-1} \ldotsm u↓1u↓0)↓2 ≡ (u↓{m-r-1} \ldotsm
u↓0\,0 \ldotsm 0)↓2 - (0 \ldotsm 0\,u↓{m-1}\ldotsm u↓{m-r})↓2.\eqno(36)$$
%folio 388 galley 15 (C) Addison-Wesley 1978	*
Given a sequence of $K = 2↑k$ integers
$(a↓0, \ldotss , a↓{K-1})$, and an integer $\omega$ such that
$\omega ↑K ≡ 1\modulo M$, the integer Fourier transform
$${\A a}↓s =\bigglp\sum ↓{0≤t<K}\omega ↑{st}a↓t\biggrp\mod M,\qquad
0 ≤ s < K,\eqno (37)$$
can be calculated rapidly as follows. (In these
formulas the $s↓j$ and $t↓j$ are either 0 or 1, so that each
step represents $2↑k$ computations.)

\yskip\noalign Step 0.\xskip Let $A↑{[0]}(t↓{k-1},
\ldotss , t↓0) = a↓t$, where $t = (t↓{k-1} \ldotsm t↓0)↓2$.

\yskip\noalign Step 1.\xskip Set $A↑{[1]}(s↓{k-1},t↓{k-2}, \ldotss , t↓0)←$

\penalty1000\vskip2pt
\hjust to size{\hfill$ \biglp A↑{[0]}(0, t↓{k-2}, \ldotss , t↓0) +
\omega ↑{(s↓{k-1}0\ldotsm0)↓2} \cdot A↑{[0]}(1, t↓{k-2}, \ldotss
, t↓0)\bigrp \mod M$.}

\yskip\noalign Step 2.\xskip Set $A↑{[2]}(s↓{k-1},
s↓{k-2}, t↓{k-3}, \ldotss , t↓0)←$

\penalty1000\vskip2pt
\hjust to size{\hfill$ \biglp A↑{[1]}(s↓{k-1}, 0, t↓{k-3}, \ldotss
, t↓0) + \omega ↑{(s↓{k-2}s↓{k-1}0 \ldotsm 0)↓2} \cdot A↑{[1]}(s↓{k-1},
1, t↓{k-3}, \ldotss , t↓0)\bigrp \mod M$.}

\yskip$\ldots$

\yskip\noalign Step $k$.\xskip Set $A↑{[k]}(s↓{k-1},
\ldotss , s↓1, s↓0)←$

\penalty1000\vskip2pt
\hjust to size{\hfill$\biglp A↑{[k-1]}(s↓{k-1}, \ldotss , s↓1, 0)
+ \omega ↑{(s↓0s↓1 \ldotsm s↓{k-1})↓2} \cdot A↑{[k-1]}(s↓{k-1},
\ldotss , s↓1, 1)\bigrp \mod M$.}

\yskip\noindent It is not difficult to prove by induction that
we have
$$\twoline{A↑{[j]}(s↓{k-1}, \ldotss , s↓{k-j}, t↓{k-j-1}, \ldotss
, t↓0)}{7pt}{\null=\sum ↓{0≤t↓{k-1}, \ldotss , t↓{k-j}≤1}\omega
↑{(s↓0s↓1 \ldotsm s↓{k-1})↓2 \cdot (t↓{k-1} \ldotsm t↓{k-j}0 \ldotsm
0)↓2} a↓t \mod M,}$$
so that
$$A↑{[k]}(s↓{k-1}, \ldotss , s↓1, s↓0) = {\A a}↓s,\qquad
\hjust{where $s = (s↓0s↓1 \ldotsm s↓{k-1})↓2$.}$$
(Note the reversed order of the binary digits in
$s$. For further discussion of transforms such as this, see
Section 4.6.4.)

Now we have enough machinery at
our disposal to do the calculation of all $W↓r$ as promised.
Let $\omega = 2↑{2↑{l+3-k}}$, so that $\omega ↑K = 2↑{8L}
≡ 1\modulo M$, where $M = 2↑{4L} + 1$. The integer fast
Fourier transform algorithm above can be applied to $(U↓0, \ldotss
, U↓{K-1})$ to obtain $({\A U}↓{\!0}, \ldotss , {\A U}↓{\!K-1})$;
each of the $k$ steps involves $2↑k$ computations of the form
$c = (a + 2↑eb)\mod M$, so the running time is $O(k2↑kL) =
O(kN)$. Similarly we obtain $({\A V}↓{\!0}, \ldotss , {\A V}↓{\!K-1})$
 in $O(kN)$ steps. The next step is to compute
$$(a↓0, a↓1, \ldotss , a↓{K-1}) = (U↓0V↓0, U↓1V↓1, \ldotss ,
U↓{K-1}V↓{K-1})\mod M,$$
using a high-speed multiplication procedure for
each of these products, obtaining the results mod $M$ by subtracting
the most significant halves from the least significant halves.
If we now use the fast Fourier transform a third time, obtaining
$({\A a}↓0, {\A a}↓1, \ldotss , {\A a}↓{K-1})$, this
is enough to determine $(W↓0, W↓1, \ldotss , W↓{K-1})$ without
much more work, since we shall prove that
$$2↑kW↓r ≡ {\A a}↓{(-r)\mod K}\modulo M.\eqno (38)$$
This congruence means that an appropriate shifting
operation, namely to multiply $-{\A a}↓{(-r)\mod K}$ by $2↑{4L-k}\mod
M$ as in (36), finally yields $W↓r$.

All this may seem like magic, but it works;
a careful study of the above remarks will show that the method
is very clever but not a complete mystery. The proof of (38)
relies primarily on the fact that $\omega ↑{K/2} ≡ -1\modulo
M$, because this fact can be used to prove that
$$\sum ↓{0≤t<K}\omega ↑{st} = \left\{\vcenter{\baselineskip14pt
\halign{$#$,\hfill⊗\qquad if $s\mod K#$\hfill\cr
K⊗=0;\cr 0⊗≠0.\cr}}\right.\eqno(39)$$
For when $s \mod K ≠ 0$, let $s \mod K = 2↑pq$
where $q$ is odd and $0 ≤ p < k$. Setting $T = 2↑{k-1-p}$, we
have $\omega ↑{sT} ≡ \omega ↑{qK/2} ≡ -1$; hence $\omega ↑{s(t+T)}
≡ -\omega↑{st}$, and terms $T$ steps apart in the sum (39) cancel in pairs.
From (39) we deduce (38) as follows:
$$\eqalign{{\A a}↓s\;⊗≡\;\sum↓{0≤t<K}\omega↑{st}{\A U}↓{\!t}{\A V}↓{\!t}\;≡\;\sum
↓{0≤t,i,j<K}\omega↑{st}\,\omega↑{ti}U↓i\;\omega↑{tj}V↓j\cr
\noalign{\vskip7pt}
⊗\;≡\sum↓{0≤i,j<K}U↓iV↓j\sum↓{0≤t<K}\omega↑{(s+i+j)t}\;≡\;K
\sum↓{\scriptstyle0≤i,j<K\atop\scriptstyle i+j+s≡0\modulo K}U↓iV↓j.\cr}$$

The multiplication procedure is
nearly complete; it remains for us to specify $k$ and $l$, and
to total up the amount of work involved. Let $M(n)$ denote the
time it takes to multiply $2↑n$-bit numbers by the above method,
and let $M↑\prime (n) = M(n)/2↑n$. The calculation time involves
$O(kN)$ steps for the three Fourier transforms and the other
auxiliary operations, plus $2↑k$ multiplications of
integers in the interval $[0, 2↑{4L}]$, hence we have
$$M(n) = 2↑kM(l + 2) + O(kN);\qquad M↑\prime (n) = 2M↑\prime
(l + 2) + O(k).$$
We get the best reduction of $M↑\prime (n)$ when
$l$ is chosen to be as low as possible, consistent with (33),
so we set
$$k = \lfloor n/2\rfloor + 2,\qquad l = \lceil n/2\rceil -
1.\eqno (40)$$

We have proved that there is a constant
$C$ such that
$$M↑\prime (n) ≤ 2M↑\prime (\lceil (n - 2)/2\rceil + 2) + Cn,\qquad
\hjust{for all $n ≥ 4$.}$$
Iterating this relation (cf.\ exercise 1.2.4--35)
yields
$$M↑\prime (n) ≤ 2↑jM↑\prime (\lceil (n - 2)/2↑j\rceil + 2)
+ C(2↑{j-1}j\lceil (n - 2)/2↑{j-1}\rceil + 2↑{j+1} - 2),$$
for $j = 1$, 2, $\ldotss$, $\lceil\lg (n - 2)\rceil
$; and $j = \lceil\lg(n - 2)\rceil$ yields $M↑\prime (n)
= O(n\log n)$. We have proved the main result of this section:

\algbegin Theorem S ({\rm A. Sch\"onhage, V. Strassen}).
{\sl It is possible to multiply two $n$-bit numbers in $O(n\log
n\log\log n)$ steps.}

\yyskip
Our formulation of the multiplication
procedure was designed primarily for simplicity of exposition,
it does not turn out to be an especially fast method for small
$n$; for example, a lot of the bits that the above method deals
with are known to be zero. Thus the algorithm needs to be refined
somewhat if it is ever to become competitive with Algorithm
C when $n$ is in a practical range. As $n → ∞$, of course, fast
Fourier multiplication becomes vastly superior to Algorithm
C. John Pollard has presented a fast Fourier multiplication
algorithm that is useful for moderately large $n$, in {\sl
Math.\ Comp.\ \bf 25} (1971), 365--374.
\vfill\eject
\setcount0 801
\runninglefthead{ANSWERS TO EXERCISES---FIRST PROOFS $\copyright$ 1978}
\runningrighthead{ANSWERS TO EXERCISES---FIRST PROOFS $\copyright$ 1978}
\section{4.x}
\ninepoint